The main goal of this report is to try to unravel which symptoms of COVID-19 disease are most likely to lead to death. In order to do that a dataset was downloaded and used in the following analysis. The process of gathering this data is described in this article.
The analysis consists of a short data characteristics section followed by four main parts. In the first section an attempt is made to deremine which attributes are correlated the strongest to the outcome of the patient (dead or alive). Next comes a section that includes an interactive plot for each of the attributes from the previous section. There is also a short explaination of what these attributes mean biologically and why they may be correlated to the outcome. The third section is a classification attempt. Based on the most correlated attributes there is an attempt to train the random forest classification algorithm to classify patients to a certain group (dead or alive). The final accuracy odf the created model is 97%, despite being based only on six attributes from all the 78 from the original dataset. The last section shows which variables happened to be the most important in the classification process. It seems that the lactate dehydrogenase variable had the biggest impact on the process with importance value of 74.6, while the second important variable was High-sensitivity C-reactive protein with value more than 2 times smaller (27,2). This outcome is corresponding to the article linked earlier as the same attributes were used in analysis performed by its author, which leads to a conclusion of this report being meaningful.
## [1] "kableExtra" "caret" "lattice" "crosstalk" "corrplot"
## [6] "corrr" "openxlsx" "plotly" "ggplot2" "formattable"
## [11] "tidyr" "dplyr" "knitr" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
The provided data is organized in such a way, that for each patient there are several rows. Each one of them describes a single moment of time in which a measurement of a certain group of parameters occurred. Because of this approach there are a lot of NA values in the data both rowwise and columnwise (not every parameter was measured during a single examination).
| Rows.in.the.dataset | Columns.in.the.dataset | Numeric.attributes | First.admission | Last.discharge |
|---|---|---|---|---|
| 6120 | 84 | 78 | 2020-01-10 15:52:20 | 2020-03-04 16:21:51 |
| Gender | Number of cases |
|---|---|
| Male | 224 |
| Female | 151 |
| PATIENT_ID | Min. : 1.0 | 1st Qu.: 92.0 | Median :185.0 | Mean :184.8 | 3rd Qu.:270.0 | Max. :375.0 | |
| age | Min. :18.00 | 1st Qu.:47.00 | Median :62.00 | Mean :59.44 | 3rd Qu.:71.00 | Max. :95.00 | |
| gender | Min. :1.000 | 1st Qu.:1.000 | Median :1.000 | Mean :1.391 | 3rd Qu.:2.000 | Max. :2.000 | |
| Admission_time | Min. :2020-01-10 15:52:20 | 1st Qu.:2020-02-01 00:06:16 | Median :2020-02-04 15:53:12 | Mean :2020-02-03 18:57:56 | 3rd Qu.:2020-02-09 02:06:58 | Max. :2020-02-17 21:30:07 | |
| Discharge_time | Min. :2020-01-23 09:09:23 | 1st Qu.:2020-02-13 19:06:26 | Median :2020-02-17 21:50:30 | Mean :2020-02-16 21:40:09 | 3rd Qu.:2020-02-19 13:30:26 | Max. :2020-03-04 16:21:51 | |
| outcome | Min. :0.0000 | 1st Qu.:0.0000 | Median :0.0000 | Mean :0.4747 | 3rd Qu.:1.0000 | Max. :1.0000 | |
| Hypersensitive_cardiac_troponinI | Min. : 1.9 | 1st Qu.: 4.4 | Median : 20.6 | Mean : 1223.2 | 3rd Qu.: 223.8 | Max. :50000.0 | NA’s :5613 |
| hemoglobin | Min. : 6.4 | 1st Qu.:113.0 | Median :125.0 | Mean :123.1 | 3rd Qu.:137.0 | Max. :178.0 | NA’s :5145 |
| Serum_chloride | Min. : 71.50 | 1st Qu.: 99.05 | Median :102.10 | Mean :103.14 | 3rd Qu.:105.65 | Max. :140.40 | NA’s :5145 |
| Prothrombin_time | Min. : 11.50 | 1st Qu.: 13.60 | Median : 14.80 | Mean : 16.68 | 3rd Qu.: 16.70 | Max. :120.00 | NA’s :5458 |
| procalcitonin | Min. : 0.020 | 1st Qu.: 0.040 | Median : 0.100 | Mean : 1.107 | 3rd Qu.: 0.405 | Max. :57.170 | NA’s :5661 |
| eosinophils… | Min. :0.000 | 1st Qu.:0.000 | Median :0.100 | Mean :0.629 | 3rd Qu.:0.800 | Max. :8.600 | NA’s :5163 |
| Interleukin_2_receptor | Min. : 61.0 | 1st Qu.: 459.5 | Median : 676.5 | Mean : 907.2 | 3rd Qu.:1155.5 | Max. :7500.0 | NA’s :5852 |
| Alkaline_phosphatase | Min. : 17.00 | 1st Qu.: 54.00 | Median : 69.50 | Mean : 82.47 | 3rd Qu.: 95.00 | Max. :620.00 | NA’s :5190 |
| albumin | Min. :13.60 | 1st Qu.:27.40 | Median :32.20 | Mean :32.01 | 3rd Qu.:36.60 | Max. :48.60 | NA’s :5186 |
| basophil… | Min. :0.00 | 1st Qu.:0.10 | Median :0.20 | Mean :0.21 | 3rd Qu.:0.30 | Max. :1.70 | NA’s :5163 |
| Interleukin_10 | Min. : 5.00 | 1st Qu.: 5.00 | Median : 5.90 | Mean : 16.07 | 3rd Qu.: 12.35 | Max. :1000.00 | NA’s :5853 |
| Total_bilirubin | Min. : 2.50 | 1st Qu.: 7.40 | Median : 10.70 | Mean : 16.70 | 3rd Qu.: 16.77 | Max. :505.70 | NA’s :5190 |
| Platelet_count | Min. : -1.0 | 1st Qu.:109.0 | Median :178.0 | Mean :184.3 | 3rd Qu.:248.0 | Max. :558.0 | NA’s :5163 |
| monocytes… | Min. : 0.300 | 1st Qu.: 2.800 | Median : 5.700 | Mean : 6.155 | 3rd Qu.: 8.600 | Max. :53.000 | NA’s :5162 |
| antithrombin | Min. : 20.00 | 1st Qu.: 74.00 | Median : 86.00 | Mean : 85.32 | 3rd Qu.: 97.00 | Max. :136.00 | NA’s :5790 |
| Interleukin_8 | Min. : 5.000 | 1st Qu.: 8.675 | Median : 16.000 | Mean : 83.088 | 3rd Qu.: 35.200 | Max. :6795.000 | NA’s :5852 |
| indirect_bilirubin | Min. : 0.100 | 1st Qu.: 3.800 | Median : 5.400 | Mean : 6.889 | 3rd Qu.: 8.000 | Max. :145.100 | NA’s :5214 |
| Red_blood_cell_distribution_width | Min. :10.60 | 1st Qu.:12.00 | Median :12.60 | Mean :13.07 | 3rd Qu.:13.70 | Max. :27.10 | NA’s :5197 |
| neutrophils_percent | Min. : 1.7 | 1st Qu.:65.1 | Median :82.4 | Mean :77.6 | 3rd Qu.:92.3 | Max. :98.9 | NA’s :5163 |
| total_protein | Min. :31.80 | 1st Qu.:61.00 | Median :65.90 | Mean :65.30 | 3rd Qu.:70.45 | Max. :88.70 | NA’s :5189 |
| Quantification_of_Treponema_pallidum_antibodies | Min. : 0.020 | 1st Qu.: 0.040 | Median : 0.050 | Mean : 0.132 | 3rd Qu.: 0.070 | Max. :11.950 | NA’s :5841 |
| Prothrombin_activity | Min. : 6.00 | 1st Qu.: 65.00 | Median : 81.00 | Mean : 78.55 | 3rd Qu.: 95.00 | Max. :142.00 | NA’s :5461 |
| HBsAg | Min. : 0.000 | 1st Qu.: 0.000 | Median : 0.010 | Mean : 8.306 | 3rd Qu.: 0.010 | Max. :250.000 | NA’s :5841 |
| mean_corpuscular_volume | Min. : 61.60 | 1st Qu.: 86.90 | Median : 90.10 | Mean : 90.39 | 3rd Qu.: 93.90 | Max. :118.90 | NA’s :5163 |
| hematocrit | Min. :14.50 | 1st Qu.:33.50 | Median :36.60 | Mean :36.55 | 3rd Qu.:39.90 | Max. :52.30 | NA’s :5163 |
| White_blood_cell_count | Min. : 0.13 | 1st Qu.: 4.94 | Median : 7.72 | Mean : 15.60 | 3rd Qu.: 12.72 | Max. :1726.60 | NA’s :4993 |
| Tumor_necrosis_factorα | Min. : 4.00 | 1st Qu.: 6.70 | Median : 8.60 | Mean : 11.58 | 3rd Qu.: 11.50 | Max. :168.00 | NA’s :5852 |
| mean_corpuscular_hemoglobin_concentration | Min. :286.0 | 1st Qu.:333.0 | Median :343.0 | Mean :342.8 | 3rd Qu.:350.0 | Max. :514.0 | NA’s :5163 |
| fibrinogen | Min. : 0.500 | 1st Qu.: 3.050 | Median : 4.120 | Mean : 4.294 | 3rd Qu.: 5.480 | Max. :10.780 | NA’s :5554 |
| Interleukin_1β | Min. : 5.00 | 1st Qu.: 5.00 | Median : 5.00 | Mean : 6.51 | 3rd Qu.: 5.00 | Max. :88.50 | NA’s :5852 |
| Urea | Min. : 0.800 | 1st Qu.: 4.000 | Median : 5.985 | Mean : 9.589 | 3rd Qu.:11.400 | Max. :68.400 | NA’s :5184 |
| lymphocyte_count | Min. : 0.000 | 1st Qu.: 0.460 | Median : 0.800 | Mean : 1.017 | 3rd Qu.: 1.310 | Max. :52.420 | NA’s :5163 |
| PH_value | Min. :5.000 | 1st Qu.:6.000 | Median :6.500 | Mean :6.484 | 3rd Qu.:7.294 | Max. :7.565 | NA’s :5736 |
| Red_blood_cell_count | Min. : 0.100 | 1st Qu.: 3.680 | Median : 4.140 | Mean : 9.288 | 3rd Qu.: 4.650 | Max. :749.500 | NA’s :4993 |
| Eosinophil_count | Min. :0.000 | 1st Qu.:0.000 | Median :0.010 | Mean :0.039 | 3rd Qu.:0.060 | Max. :0.490 | NA’s :5163 |
| Corrected_calcium | Min. :1.650 | 1st Qu.:2.270 | Median :2.360 | Mean :2.355 | 3rd Qu.:2.440 | Max. :2.790 | NA’s :5206 |
| Serum_potassium | Min. : 2.760 | 1st Qu.: 3.950 | Median : 4.410 | Mean : 4.509 | 3rd Qu.: 4.870 | Max. :12.800 | NA’s :5140 |
| glucose | Min. : 1.000 | 1st Qu.: 5.550 | Median : 6.990 | Mean : 8.889 | 3rd Qu.:10.260 | Max. :43.010 | NA’s :5345 |
| neutrophils_count | Min. : 0.06 | 1st Qu.: 3.09 | Median : 5.85 | Mean : 7.81 | 3rd Qu.:10.95 | Max. :33.88 | NA’s :5163 |
| Direct_bilirubin | Min. : 1.600 | 1st Qu.: 3.225 | Median : 4.800 | Mean : 9.887 | 3rd Qu.: 8.275 | Max. :360.600 | NA’s :5190 |
| Mean_platelet_volume | Min. : 8.50 | 1st Qu.:10.10 | Median :10.80 | Mean :10.91 | 3rd Qu.:11.50 | Max. :15.00 | NA’s :5258 |
| ferritin | Min. : 17.8 | 1st Qu.: 377.2 | Median : 711.0 | Mean : 1379.1 | 3rd Qu.: 1425.2 | Max. :50000.0 | NA’s :5837 |
| RBC_distribution_width_SD | Min. : 31.30 | 1st Qu.: 38.50 | Median : 40.90 | Mean : 42.44 | 3rd Qu.: 44.70 | Max. :113.30 | NA’s :5197 |
| Thrombin_time | Min. : 13.00 | 1st Qu.: 15.60 | Median : 16.80 | Mean : 18.17 | 3rd Qu.: 18.38 | Max. :161.90 | NA’s :5554 |
| lymphocyte_percent | Min. : 0.000 | 1st Qu.: 3.925 | Median :11.450 | Mean :15.392 | 3rd Qu.:24.975 | Max. :60.000 | NA’s :5162 |
| HCV_antibody_quantification | Min. :0.020 | 1st Qu.:0.040 | Median :0.060 | Mean :0.117 | 3rd Qu.:0.090 | Max. :2.090 | NA’s :5841 |
| D.D_dimer | Min. : 0.210 | 1st Qu.: 0.603 | Median : 2.155 | Mean : 7.943 | 3rd Qu.:21.000 | Max. :60.000 | NA’s :5490 |
| Total_cholesterol | Min. :0.100 | 1st Qu.:3.010 | Median :3.630 | Mean :3.689 | 3rd Qu.:4.265 | Max. :7.300 | NA’s :5189 |
| aspartate_aminotransferase | Min. : 6.00 | 1st Qu.: 19.50 | Median : 27.00 | Mean : 46.53 | 3rd Qu.: 42.00 | Max. :1858.00 | NA’s :5185 |
| Uric_acid | Min. : 43.0 | 1st Qu.: 183.2 | Median : 243.7 | Mean : 276.1 | 3rd Qu.: 333.8 | Max. :1176.0 | NA’s :5186 |
| HCO3. | Min. : 6.30 | 1st Qu.:21.00 | Median :23.50 | Mean :23.14 | 3rd Qu.:25.90 | Max. :36.30 | NA’s :5186 |
| calcium | Min. :1.170 | 1st Qu.:1.980 | Median :2.080 | Mean :2.078 | 3rd Qu.:2.190 | Max. :2.620 | NA’s :5141 |
| Amino.terminal_brain_natriuretic_peptide_precursor.NT.proBNP. | Min. : 5 | 1st Qu.: 150 | Median : 585 | Mean : 3669 | 3rd Qu.: 2625 | Max. :70000 | NA’s :5645 |
| Lactate_dehydrogenase | Min. : 110.0 | 1st Qu.: 218.0 | Median : 340.0 | Mean : 474.2 | 3rd Qu.: 601.8 | Max. :1867.0 | NA’s :5186 |
| platelet_large_cell_ratio | Min. :11.20 | 1st Qu.:25.60 | Median :30.90 | Mean :31.77 | 3rd Qu.:37.20 | Max. :62.20 | NA’s :5258 |
| Interleukin_6 | Min. : 1.500 | 1st Qu.: 4.772 | Median : 19.265 | Mean : 112.308 | 3rd Qu.: 60.167 | Max. :5000.000 | NA’s :5848 |
| Fibrin_degradation_products | Min. : 4.00 | 1st Qu.: 4.00 | Median : 17.90 | Mean : 61.35 | 3rd Qu.:150.00 | Max. :190.80 | NA’s :5790 |
| monocytes_count | Min. : 0.010 | 1st Qu.: 0.270 | Median : 0.410 | Mean : 0.526 | 3rd Qu.: 0.580 | Max. :39.920 | NA’s :5163 |
| PLT_distribution_width | Min. : 8.00 | 1st Qu.:11.10 | Median :12.40 | Mean :13.01 | 3rd Qu.:14.30 | Max. :25.30 | NA’s :5258 |
| globulin | Min. :10.10 | 1st Qu.:29.70 | Median :32.70 | Mean :33.24 | 3rd Qu.:36.50 | Max. :50.60 | NA’s :5190 |
| γ.glutamyl_transpeptidase | Min. : 3.00 | 1st Qu.: 22.00 | Median : 34.00 | Mean : 55.34 | 3rd Qu.: 58.00 | Max. :732.00 | NA’s :5190 |
| International_standard_ratio | Min. : 0.840 | 1st Qu.: 1.030 | Median : 1.140 | Mean : 1.313 | 3rd Qu.: 1.330 | Max. :13.480 | NA’s :5461 |
| basophil_count… | Min. :0.000 | 1st Qu.:0.010 | Median :0.010 | Mean :0.017 | 3rd Qu.:0.020 | Max. :0.120 | NA’s :5163 |
| X2019.nCoV_nucleic_acid_detection | Min. :-1 | 1st Qu.:-1 | Median :-1 | Mean :-1 | 3rd Qu.:-1 | Max. :-1 | NA’s :5619 |
| mean_corpuscular_hemoglobin | Min. :20.4 | 1st Qu.:29.7 | Median :30.9 | Mean :31.0 | 3rd Qu.:32.2 | Max. :50.8 | NA’s :5163 |
| Activation_of_partial_thromboplastin_time | Min. : 21.80 | 1st Qu.: 35.30 | Median : 39.20 | Mean : 41.52 | 3rd Qu.: 44.12 | Max. :144.00 | NA’s :5552 |
| hsCRP | Min. : 0.10 | 1st Qu.: 5.70 | Median : 51.50 | Mean : 76.24 | 3rd Qu.:118.50 | Max. :320.00 | NA’s :5383 |
| HIV_antibody_quantification | Min. :0.05 | 1st Qu.:0.07 | Median :0.09 | Mean :0.10 | 3rd Qu.:0.11 | Max. :0.27 | NA’s :5842 |
| serum_sodium | Min. :115.4 | 1st Qu.:137.7 | Median :140.4 | Mean :141.6 | 3rd Qu.:143.5 | Max. :179.7 | NA’s :5145 |
| thrombocytocrit | Min. :0.010 | 1st Qu.:0.150 | Median :0.210 | Mean :0.212 | 3rd Qu.:0.270 | Max. :0.510 | NA’s :5258 |
| ESR | Min. : 1.00 | 1st Qu.: 14.00 | Median : 28.00 | Mean : 33.69 | 3rd Qu.: 45.50 | Max. :110.00 | NA’s :5737 |
| glutamic.pyruvic_transaminase | Min. : 5.00 | 1st Qu.: 16.00 | Median : 24.00 | Mean : 38.86 | 3rd Qu.: 41.00 | Max. :1600.00 | NA’s :5189 |
| eGFR | Min. : 2.00 | 1st Qu.: 63.58 | Median : 87.90 | Mean : 81.56 | 3rd Qu.:103.97 | Max. :224.00 | NA’s :5184 |
| creatinine | Min. : 11.00 | 1st Qu.: 58.00 | Median : 76.00 | Mean : 109.93 | 3rd Qu.: 98.25 | Max. :1497.00 | NA’s :5184 |
| Measurement_time | Min. :2020-01-10 19:45:00 | 1st Qu.:2020-02-04 13:44:00 | Median :2020-02-09 12:42:30 | Mean :2020-02-08 07:00:02 | 3rd Qu.:2020-02-13 10:34:00 | Max. :2020-02-18 17:49:00 | NA’s :14 |
| outcome_text | Alive:3215 | Dead :2905 | |||||
| Gender | Male :3730 | Female:2390 | |||||
| Normalized_time | Min. : 0.00 | 1st Qu.: 1.25 | Median : 56.85 | Mean : 98.71 | 3rd Qu.:167.29 | Max. :524.25 | NA’s :14 |
To create a correlation matrix all measurements of every patient have to be aggregated into a single row. Hence an aggregation method must be chosen for columns containing more than one value. In the following block there are three different data frames created. Each of them utilizes a different aggregating method - mean, max and last. The “last” method means that only the most recent data is taken into consideration. Then all of these data frames are used to create three correlation data frames with the use of a package names corrr which allows to omit the phase of creating a correlation matrix and converting it into a data frame. In the following blocks and explanations I will refer to these three methods as “median”, “mean” and “last” correlations.
The library corrr allows to select concrete attribute that the analysis needs to “focus” on, which means that it will filter out all the correlations not connected to the selected attribute. In this study we want to determine which attributes can cause which outcome of the disease, so the focused attribute is “outcome”. The results are shown below in a form of bar plots. To maintain readability of the plots only correlations higher than 0.6 or lower than -0.6 are shown. The bars can be hovered above to show precise values of the correlations.
The correlation plots show that no matter what the aggregation method is the same group of attributes attributes is correlated to the outcome the strongest. There are some differences, but overall these are the same attributes repeated three times. Because of that the following analysis will focus mostly on neutrophils (percentage), fibrin degradation products (since D-dimer is its subtype it won’t be included), lactate dehydrogenase, high-sensitivity C-reactive protein, calcium, prothombin activity, albumin and lymphocyte percentage.
There are several interactive plots presented in this section. For visualization purposes the timestamp of each measurement was normalized - the difference between the first the actual measurement time and the first measurement that a given patient had. As a result the Normalized_time variable contains the number of hours that had passed from the first examination the patient had had. This approach allows to visualize and compare courses of a certain attribute among numerous patients on a single plot.
This plot show some extremely chaotic data concerning deceased patients. There is practically no trend or anything more to say about this data expect for the levels of hsCRP are quite high comparing to these of the patients who lived. If we select only the Alive patients we can see that in almost every case the hsCRP was decreasing over time. This is because hsCRP is a blood test that measures the level of inflammation in one’s body, it’s used for example for determining the chance of a heart disease or a stroke. High value returned by hsCRP means high inflammation, what makes sense concerning that people with high hsCRP infected with COVID-19 died.
Fibrin degradation products are components of the blood produced by clot degeneration. The value of FDP is high after any thrombotic event. The chaotic data on the plot might indicate that the patients with high FDP (which are only those who died later on) suffered from some kind of a blood dysfunction.
Lactate dehydrogenase is an enzyme that is present in almost every living cell. Its high levels (up to 4 times larger in deceased patients than in alive ones) can indicate an early stage of heart attacks and in general are a negative prognostic factor.
Lower levels of calcium among deceased patients can indicate numerous things, however hypocalcemia can lead to several muscle-oriented problems, such as tetany or even disruption of conductivity in the cardiac tissue. The effect of low calcium levels has been researched and can be read about in this article.
Prothrombin is a coagulation factor. This means that its role is to manage the clotting process. Low levels of prothrombin activity are related to fibrin degradation products. Low levels of prothrombin activity that occured among deceased patients can indicate problems with the clotting process.
Albumin is a main protein that occurs in the human blood, being about 60% of all the proteins. Its main role is to maintain proper oncotic pressure, that prevents leakages of water containing electrolytes from the blood vessels into tissues. A healthy person should have albumin level ranging from 30 to 55 mg/ml of blood.
Lymphocytes are, next to neutroils, one of five kinds of white blood cells. Low levels of lymphocytes can indicate autoimmune diseases, AIDS or other infectious diseases.
The dataset for the classification problem cannot contain NA variables if Random Forest is used as a training method. Because of that only several columns were chosen for the classification problem: * Lymphocyte percentage * Neutrophils percentage * High-sensitivity C-reactive protein * Lactate dehydrogenase * Albumin
These are the attributes that showed the highest correlation with the outcome, as shown in “Determining the correlation” section.
## Size of the training set: 247
## Size of the testing set: 104
## Random Forest
##
## 247 samples
## 5 predictor
## 2 classes: 'Alive', 'Dead'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 124, 123, 124, 123, 123, 124, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9684173 0.9363620
## 3 0.9651915 0.9297891
## 5 0.9538618 0.9068729
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 54 1
## Dead 3 46
##
## Accuracy : 0.9615
## 95% CI : (0.9044, 0.9894)
## No Information Rate : 0.5481
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9226
##
## Mcnemar's Test P-Value : 0.6171
##
## Precision : 0.9818
## Recall : 0.9474
## F1 : 0.9643
## Prevalence : 0.5481
## Detection Rate : 0.5192
## Detection Prevalence : 0.5288
## Balanced Accuracy : 0.9630
##
## 'Positive' Class : Alive
##
## Random Forest
##
## 247 samples
## 5 predictor
## 2 classes: 'Alive', 'Dead'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 124, 123, 124, 123, 123, 124, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.9877812 0.9630597 0.9517857
## 2 0.9883376 0.9674495 0.9732143
## 3 0.9906038 0.9645083 0.9732143
## 4 0.9877185 0.9630597 0.9571429
## 5 0.9868392 0.9615672 0.9553571
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 55 1
## Dead 2 46
##
## Accuracy : 0.9712
## 95% CI : (0.918, 0.994)
## No Information Rate : 0.5481
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9419
##
## Mcnemar's Test P-Value : 1
##
## Precision : 0.9821
## Recall : 0.9649
## F1 : 0.9735
## Prevalence : 0.5481
## Detection Rate : 0.5288
## Detection Prevalence : 0.5385
## Balanced Accuracy : 0.9718
##
## 'Positive' Class : Alive
##
Accuracy is 1 percentage point better than before parameter tuning, Kappa value is 0,02 higher, values of the remaining measures are the same or higher than before. Because of a very high accuracy of the Random Forest method no further methods were tested.
Both high precision and recall mean that the classificator performs well, since it doesn’t return much false positives or false negatives. Not detecting ill people can be however quite problematic since it could increase the strain on the medical system even more.
## rf variable importance
##
## Overall
## Lactate_dehydrogenase 74.610
## hsCRP 27.151
## neutrophils_percent 15.245
## lymphocyte_percent 3.140
## albumin 2.119
The trained model shows that lactate dehydrogenase levels have the largest impact in defining whether a patient will die or not. High-sensitivity C-reactive protein is more than 2 times less important and the neutrophils percentage comes in at the third place. This outcome is confirmed by the article from which the dataset originates from.